Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP expand Variables class to handle s3 urls from NSIDC #434

Closed
wants to merge 74 commits into from

Conversation

rwegener2
Copy link
Contributor

@rwegener2 rwegener2 commented Aug 1, 2023

What was done

At the moment the Variables class .avail() method would fail if given an s3 url. This PR expands the class to handle this case.

How it was done

In addition to file and order a new vartype, nsidc-s3, was added to the class. nsidc was specified (instead of just s3) because extracting the product and version from the filepath relies on the nsidc naming structure. So, you couldn't just give Variables the s3 path to any IS2 file anywhere on AWS; it needs to be from the nsidc-cumulus-prod-protected bucket. If others have feedback on this decision I'm happy to hear. We could alternately try to extract the product and version from the metadata, which would remove this limitation.

Within the .avail() method the nsidc-s3 filetype uses the same variable extraction strategy as order. An interesting note about this is that if we use the order method of grabbing variables it happens very quickly. It's the time to ping an API endpoint and parse the request. The .avail() function gets very slow with cloud data when trying to walk the full file (as happens when using the file argument), but we can avoid that by using the order method of extracting variables instead.

Todo / Next steps

This PR is pretty close, but blocked by determining how to pass authentication (#433). After auth is addressed this PR can be merged into the getvars branch, where work has already started changing the way the Variables class is called in Read and Query.

  • add test case
  • update after authentication PR
  • add parse as a library requirement

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

Binder 👈 Launch a binder notebook on this branch for commit 703d832

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 61b2006

Binder 👈 Launch a binder notebook on this branch for commit ba52c55

@@ -72,6 +74,14 @@ def __init__(
elif self._vartype == "file":
# DevGoal: check that the list or string are valid dir/files
self.path = path
elif self._vartype == "nsidc-s3":
# Grab metadata from s3 path
template = ('s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/'
Copy link
Member

@weiji14 weiji14 Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template looks to be hardcoded to the non-gridded ATLAS datasets like ATL06. Other gridded products (e.g. ATL14, ATL16, ATL20) would have a different template like s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/{year}/{filename}' . E.g. looking at https://search.earthdata.nasa.gov/search?ff=Available%20in%20Earthdata%20Cloud&fi=ATLAS&gdf=HDF&fst0=Cryosphere&lat=65.08299573518599&long=-25.69921875&zoom=5:

Dataset Sample path
ATL06 s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2023/04/16/ATL06_20230416235213_04061911_006_02.h5
ATL07 s3://nsidc-cumulus-prod-protected/ATLAS/ATL07/005/2022/10/12/ATL07-02_20221012220720_03391701_005_01.h5
ATL10 s3://nsidc-cumulus-prod-protected/ATLAS/ATL10/005/2022/10/12/ATL10-01_20221012220720_03391701_005_01.h5
ATL11 s3://nsidc-cumulus-prod-protected/ATLAS/ATL11/005/2022/03/27/ATL11_006305_0315_005_03.h5
ATL14 s3://nsidc-cumulus-prod-protected/ATLAS/ATL14/002/2019/ATL14_IS_0314_100m_002_01.nc
ATL16 s3://nsidc-cumulus-prod-protected/ATLAS/ATL16/004/2022/ATL16_20220722003637_04601601_004_01.h5
ATL20 s3://nsidc-cumulus-prod-protected/ATLAS/ATL20/003/2022/ATL20-01_20220901002201_10861601_003_01.h5

Would it be possible to generalize this code to both non-gridded and gridded products?

Copy link
Contributor Author

@rwegener2 rwegener2 Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching this @weiji14! I'll work on a fix and let you know when I'm ready for another review!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for @JessicaS11 and @weiji14 -- What do we think of getting the version and product name from inside the file instead of parsing it from the filename? I've only checked a handful of products, but those fields seem to be available in top-level metadata in a consistent way. I've been trying to parse those things out of the filename, which is how I believe it is also done elsewhere in the module, but this limits the files icepyx can process to those named in a very specific way. If we grab product/version from inside the file we are able to process more files (ex. cloud icesat-2 files not in nsidc bucket, or local files that have had their name changed). Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the place I was thinking about this last was in the branch to remove intake from icepyx. I just pushed a WIP PR (#438) so there is a place to discuss questions. Hopefully whatever we decide there about accessing the product/version from that can be used for this PR later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion summarized in #438 (comment) indicates our intention to move away from requiring the user provide the product as input (unless they are also feeding in a directory containing files from multiple products). This should address the template issues noted here.

@rwegener2
Copy link
Contributor Author

This PR prompted two prior PRs: #444 and #451. I think those updates will are exciting for enabling the goal of this PR, allowing the Variables class to list s3 data variables. The changes in those two prior have made the approach to this goal is quite different than when this PR was opened (it was more complicated back then!). As a result I'm going to close this PR and open a new one to pursue adding s3 url reads to the Variables class.

@rwegener2 rwegener2 closed this Oct 23, 2023
@rwegener2 rwegener2 deleted the s3_variables branch October 23, 2023 21:23
@JessicaS11 JessicaS11 linked an issue Jan 24, 2024 that may be closed by this pull request
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Variables an independent class
3 participants